ROSSMANN Stores Sales Predictions¶
Summary¶
- Context
- Challenge
- Solution Development
- Conclusion and Demonstration
- Next Steps
1. Context¶
- Monthly Results Meeting
- CFO asked for a Sales Forecast for the Next 6 Weeks for each Store
2. Challenge¶
Solution¶
Using Machine Learning to forecast sales for all stores
Sales predictions can be viewed on a smartphone.
3. Solution Development¶
Data Description¶
Data Dimension¶
Number of Rows 1017209 Number of Cols 18
Descriptive Statistics¶
| attributes | min | max | range | mean | median | std | skew | kurtosis | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | store | 1.0 | 1115.0 | 1114.0 | 558.429727 | 558.0 | 321.908493 | -0.000955 | -1.200524 |
| 1 | day_of_week | 1.0 | 7.0 | 6.0 | 3.998341 | 4.0 | 1.997390 | 0.001593 | -1.246873 |
| 2 | sales | 0.0 | 41551.0 | 41551.0 | 5773.818972 | 5744.0 | 3849.924283 | 0.641460 | 1.778375 |
| 3 | customers | 0.0 | 7388.0 | 7388.0 | 633.145946 | 609.0 | 464.411506 | 1.598650 | 7.091773 |
| 4 | open | 0.0 | 1.0 | 1.0 | 0.830107 | 1.0 | 0.375539 | -1.758045 | 1.090723 |
| 5 | promo | 0.0 | 1.0 | 1.0 | 0.381515 | 0.0 | 0.485758 | 0.487838 | -1.762018 |
| 6 | school_holiday | 0.0 | 1.0 | 1.0 | 0.178647 | 0.0 | 0.383056 | 1.677842 | 0.815154 |
| 7 | competition_distance | 20.0 | 200000.0 | 199980.0 | 5935.442677 | 2330.0 | 12547.646829 | 10.242344 | 147.789712 |
| 8 | competition_open_since_month | 1.0 | 12.0 | 11.0 | 6.786849 | 7.0 | 3.311085 | -0.042076 | -1.232607 |
| 9 | competition_open_since_year | 1900.0 | 2015.0 | 115.0 | 2010.324840 | 2012.0 | 5.515591 | -7.235657 | 124.071304 |
| 10 | promo2 | 0.0 | 1.0 | 1.0 | 0.500564 | 1.0 | 0.500000 | -0.002255 | -1.999999 |
| 11 | promo2_since_week | 1.0 | 52.0 | 51.0 | 23.619033 | 22.0 | 14.310057 | 0.178723 | -1.184046 |
| 12 | promo2_since_year | 2009.0 | 2015.0 | 6.0 | 2012.793297 | 2013.0 | 1.662657 | -0.784436 | -0.210075 |
| 13 | is_promo | 0.0 | 1.0 | 1.0 | 0.155231 | 0.0 | 0.362124 | 1.904152 | 1.625796 |
Rossmann was founded in 1972. Values of Competition_open_since_year lower than 1972 indicate the years when closest competitors, from other pharmacy chains, were opened.
The competition_distance variable has high positive values of skew and kurtosis, indicating that the distribution is skewed right and has a large tail.
There is a vast difference in the range in some features. Higher ranging numbers have superiority of some sort. So these more significant number starts playing a more decisive role while training some models. It's needed to apply some sort of scaling over the features.
Mind Map Hypothesis¶
Exploratory Analysis Hypotheses¶
H1. On average, stores with a larger assortment should sell more.
H2. Stores with closer competitors should sell less.
H3. Stores with competitors that have been around for longer should sell more.
H4. Stores with more consecutive promotions should sell more than stores with regular promotion
H5. Stores open during the Christmas holidays should sell more.
H6. Stores should sell more over the years.
H7. Stores should sell more in the second half of the year
H8. Stores should sell less after the 10th of each month.
H9. Stores should sell more on average on weekends.
H10. Stores should sell less during school holidays.
Exploratory Data Analysis¶
Response Variable¶
Numerical Variables¶
/tmp/ipykernel_50303/2232118446.py:6: UserWarning: To output multiple subplots, the figure containing the passed axes is being cleared. num_attributes.hist(bins=25, ax=ax);
Categorical Variables¶
Hypothesis Validation¶
H1. On average, stores with a larger assortment should sell more.¶
True stores with a larger assortment sell more on average
H4. Stores with more consecutive promotions should sell more than stores with regular promotion¶
Falso Stores with more consecutive promotions sell less
H9. Stores should sell more on average on weekends.¶
False, there is not enough evidence to conclude that sales on weekends are greater than sales on weekdays.
Summary of Hypotheses¶
Hypotheses Conclusion Relevance ------------ ------------ ----------- H1 True High H2 False Low H3 False Low H4 False High H5 True High H6 True Low H7 True High H8 True High H9 False High H10 False High
Multivariate Analysis¶
Numerical Attributes¶
Categorical Attributes¶
Machine Learning Modelling¶
Compare Model's Performance¶
| Model Name | MAE | MAPE | RMSE | |
|---|---|---|---|---|
| 0 | XGB Regressor | 694.066535 | 0.102831 | 999.914943 |
| 1 | Random Forest Regressor | 747.458229 | 0.111702 | 1098.595402 |
| 2 | Average Model | 1429.763326 | 0.216814 | 1939.328730 |
| 3 | Linear Regression | 1867.623495 | 0.296267 | 2657.022835 |
| 4 | Linear Regression - Lasso | 2192.664126 | 0.343490 | 3092.842416 |
4. Conclusion and Demonstration¶
Business Performance¶
| store | predictions | worst_scenario | best_scenario | MAE | MAPE | |
|---|---|---|---|---|---|---|
| 291 | 292 | 108,383.78 | 105,018.11 | 111,749.45 | 3,365.67 | 0.61 |
| 908 | 909 | 244,502.50 | 237,037.94 | 251,967.06 | 7,464.56 | 0.52 |
| 875 | 876 | 200,492.30 | 196,658.64 | 204,325.96 | 3,833.66 | 0.29 |
| 721 | 722 | 356,656.97 | 354,564.33 | 358,749.61 | 2,092.64 | 0.28 |
| 594 | 595 | 383,771.44 | 379,840.58 | 387,702.29 | 3,930.85 | 0.27 |
| ... | ... | ... | ... | ... | ... | ... |
| 493 | 494 | 326,901.41 | 326,470.16 | 327,332.65 | 431.25 | 0.06 |
| 373 | 374 | 258,871.70 | 258,486.16 | 259,257.24 | 385.54 | 0.06 |
| 561 | 562 | 747,947.50 | 746,976.29 | 748,918.71 | 971.21 | 0.06 |
| 958 | 959 | 255,862.62 | 255,458.45 | 256,266.80 | 404.17 | 0.05 |
| 259 | 260 | 232,898.20 | 232,580.29 | 233,216.12 | 317.91 | 0.05 |
1115 rows × 6 columns
Total Performance¶
| Scenario | Values | |
|---|---|---|
| 0 | predictions | US$290,943,413.95 |
| 1 | worst_scenario | US$290,165,597.86 |
| 2 | best_scenario | US$291,721,230.03 |
Machine Learning Performance¶
Demonstration - Telegram¶
5. Next Steps¶
- Model Workshop for Business Users
- Collect Usability Feedback
- Improve model performance (MAPE) by 5%